According to the Center for Disease Control and Prevention (CDC), cardiovascular or heart diseases account for 1 in every 4 deaths in the United States annually. The estimated costs to the nation’s economy amount to about $363 billion annually, which include hospitalizations, after services health care needs as well as lost productivity due to death and disability (SS Virani, 2021). The CDC also states that given the lifestyle habits of an average American, about half of the population (47%) are reported to have “1 of the 3 key risk factors for heart disease: high blood pressure, high cholesterol, and smoking”. However, the risk factors for developing heart diseases are vast and diverse and tend to vary between most racial and ethnic groups in the United States.
Therefore, there is an immediate need to detect the major risk factors early on, so that a wellness-oriented healthcare system can be instituted. Ideally, such a system would assess the intensity of the risk factors (compared to a baseline) and focus efforts on preventing the occurrence of developing heart diseases in the first place rather than minimizing second occurrences (Liau et al, 2010; Giampaoli et al, 2005 ).
The beginnings of the current understanding of the interconnected role of risk factors was first discussed in the Framingham Heart Study. Started in 1948, the study’s original goal was “ to identify common factors or characteristics that contribute to cardiovascular disease” (Hong 2018). However, today the FHS has morphed into a multigenerational study gathering genetic information that help analyze family patterns of certain diseases, including cardiovascular ailments (National Heart, Lung and Blood Institute). Jumping off from this, current research in the field has focused on using precision medicine algorithms along with computational development technology to assess a patient’s likelihood of developing heart disease given certain lifestyle factors. Most of the studies have focused largely on Caucasian individuals as a sample and indicating that there is a gap in the usefulness of current risk assessment tools working for individuals of different races.
In this paper, we aim to understand the different risk factors that affect the occurrence of heart diseases among the US population, so that we may be able to see some of the intermingling between different risk factors.
The original data set can be retrieved from the CDC’s Behavioral Risk Factor Surveillance System (BRFSS) annual health survey. The survey interviews over four hundred thousand individuals in the USA asking questions regarding the risk factors of heart disease. Conducted in 2020 (published on February 15, 2021), the original data set has been cleaned from its original 300 variables to 18 variables by Kamil Pytlak and can be accessed on Kaggle .
# Load the Data
heart <-read.csv("heart_2020_cleaned.csv")
# Descriptive Statistics
status (heart)
## variable q_zeros p_zeros q_na p_na q_inf p_inf
## HeartDisease HeartDisease 0 0.000 0 0 0 0
## BMI BMI 0 0.000 0 0 0 0
## Smoking Smoking 0 0.000 0 0 0 0
## AlcoholDrinking AlcoholDrinking 0 0.000 0 0 0 0
## Stroke Stroke 0 0.000 0 0 0 0
## PhysicalHealth PhysicalHealth 226589 0.709 0 0 0 0
## MentalHealth MentalHealth 205401 0.642 0 0 0 0
## DiffWalking DiffWalking 0 0.000 0 0 0 0
## Sex Sex 0 0.000 0 0 0 0
## AgeCategory AgeCategory 0 0.000 0 0 0 0
## Race Race 0 0.000 0 0 0 0
## Diabetic Diabetic 0 0.000 0 0 0 0
## PhysicalActivity PhysicalActivity 0 0.000 0 0 0 0
## GenHealth GenHealth 0 0.000 0 0 0 0
## SleepTime SleepTime 0 0.000 0 0 0 0
## Asthma Asthma 0 0.000 0 0 0 0
## KidneyDisease KidneyDisease 0 0.000 0 0 0 0
## SkinCancer SkinCancer 0 0.000 0 0 0 0
## type unique
## HeartDisease character 2
## BMI numeric 3604
## Smoking character 2
## AlcoholDrinking character 2
## Stroke character 2
## PhysicalHealth integer 31
## MentalHealth integer 31
## DiffWalking character 2
## Sex character 2
## AgeCategory character 13
## Race character 6
## Diabetic character 4
## PhysicalActivity character 2
## GenHealth character 5
## SleepTime integer 24
## Asthma character 2
## KidneyDisease character 2
## SkinCancer character 2
From the table we can see that physical and mental health have a higher percentage of zeros. This means that just over 70% of the people in our data set report being physically sick in a 30 day period while about 60% of the respondents report feeling mentally unwell in that same time period.
We can also see that there are no N/A values in our data set. This is not surprising given that our data was cleaned before being retrieved.
Below are the plots showing the means of our numerical variables. We can see that, on average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in a 30 day period is slightly higher for people with heart disease, than for people without.
# Finding the mean of numerical variables grouped by heart disease
plot_num(heart)
heart %>%
group_by(HeartDisease) %>%
summarise_at(vars("BMI",
"PhysicalHealth",
"MentalHealth",
"SleepTime"), mean)
## # A tibble: 2 × 5
## HeartDisease BMI PhysicalHealth MentalHealth SleepTime
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 No 28.2 2.96 3.83 7.09
## 2 Yes 29.4 7.81 4.64 7.14
# Changing Diabetes bordeline and pregnancy categories to Yes/No
heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"
# Changing all the categorical variables to numerical dummies
heart$HeartDisease<-ifelse(heart$HeartDisease=="Yes",1,0)
heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)
According to Noel Bairey Merz, MD, director of the Barbra Streisand Women’s Heart Center in the Smidt Heart Institute, women experience heart disease more severely than men. Meaning that even though men are more likely to suffer heart attacks, women are more likely to die from a heart attack than a man.
We started off by trying to see which gender tends to have more heart disease based on symptoms reported just before diagnosis. From the table below, we can see that we have a lot more data for people without heart disease than with, this will be addressed later. However, even in the percentage table for those who do have heart disease, we see that there are more men with heart disease (about 5%) compared to women (3%). This is in line with our basic theory that men have more heart disease. However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? Discussions of the relevant variables are provided below.
contable = table(heart$HeartDisease, heart$Sex)
xkabledply(contable, title="Contingency Table for Heart Disease and Sex")
| Female | Male | |
|---|---|---|
| 0 | 156571 | 135851 |
| 1 | 11234 | 16139 |
addmargins(contable)
##
## Female Male Sum
## 0 156571 135851 292422
## 1 11234 16139 27373
## Sum 167805 151990 319795
prop <-prop.table(contable)
percent <- round(prop*100, 2)
#print ("Contingency Table for Heart Disease and Sex in Percentage")
print(percent, main="Contingency Table for Heart Disease and Sex in Percentage")
##
## Female Male
## 0 48.96 42.48
## 1 3.51 5.05
barplot(with(heart, table(HeartDisease, Sex)), main="Heart Disease Among Male & Female", beside=T, col=3:2)
In our exploratory analysis, we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease.
hist(heart$BMI,main="Histogram of BMI in Data Set", col='pink', xlab="BMI Levels")
In the histogram, we can see that our data is approximately normal and given the Central Limit Theorem, if n>30, then we can assume normality for the sake of statistical tests.
After sub-setting the data between male and female, we conducted a Welch’s t-test on the BMIs of both population subsets.
# Subset Male & Female with and without Heart Disease
female <- subset(heart, Sex== "Female")
male <- subset (heart, Sex == "Male")
# T-test of BMI between Genders
t.test(female$BMI, male$BMI)
##
## Welch Two Sample t-test
##
## data: female$BMI and male$BMI
## t = -15, df = 3e+05, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.387 -0.299
## sample estimates:
## mean of x mean of y
## 28.2 28.5
The p-value is less than the 5% significance level, allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34.
Next, we divided our data frame between men and women who do have heart disease. Once again assuming normality due to the large sample size, we conducted another Welch’s T-test to look at the mean BMI difference among the population with heart issues.
# Subsetting genders based on respondents that have heart disease
female_HD<- subset(female, HeartDisease==1)
male_HD<- subset(male, HeartDisease== 1)
#T-test
t.test(female_HD$BMI, male_HD$BMI)
##
## Welch Two Sample t-test
##
## data: female_HD$BMI and male_HD$BMI
## t = -0.5, df = 20988, p-value = 0.6
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.208 0.121
## sample estimates:
## mean of x mean of y
## 29.4 29.4
From the T-test output above, we see that the p-value is 0.60, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease.
Earlier, we saw that respondents, who have heart disease, reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. The box plot below shows us that women with heart disease generally report felling unwell more often than men.
ggplot(heart, aes(x = factor(HeartDisease), y = PhysicalHealth))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many times in the last 30 days have you felt physically ill?') +
xlab('Heart Disease') +
ylab('Physical Health')+
scale_fill_brewer(palette = 'Set2', name = 'Gender')
t.test(male_HD$PhysicalHealth, female_HD$PhysicalHealth)
##
## Welch Two Sample t-test
##
## data: male_HD$PhysicalHealth and female_HD$PhysicalHealth
## t = -16, df = 23059, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.53 -1.97
## sample estimates:
## mean of x mean of y
## 6.88 9.14
The t-test also confirms what the box plot is showing. Since the p-value is less than 5%, that there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick about 9 days while men reported just over 6 days.
ggplot(heart, aes(x = factor(HeartDisease), y = MentalHealth))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many times in the last 30 days have you felt mentally ill?') +
xlab('Heart Disease') +
ylab('Mental Health')+
scale_fill_brewer(palette = 'Set3', name = 'Sex')
t.test (male_HD$MentalHealth,female_HD$MentalHealth)
##
## Welch Two Sample t-test
##
## data: male_HD$MentalHealth and female_HD$MentalHealth
## t = -22, df = 21050, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -2.79 -2.34
## sample estimates:
## mean of x mean of y
## 3.59 6.16
The boxplot shows that women tend to report feeling mentally unwell more often than men.Since the p-value is less than 5%, indicating that there is a difference in the number of days that men and women feel mentally unwell. On average, women reported feeling mentally unwell about twice as many days as men. The cyclical nature of mental health and exercise has been covered extensively (Mikkelsen 2017; Raglin 1990); which assert that individuals struggling with mental health issues often are unable to complete physically demanding tasks, further making matters worse. This is because exercise has been shown to produce beneficial hormones that are associated with improvements in mental health.
ggplot(heart, aes(x = factor(HeartDisease), y = SleepTime))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle(' On average,how many hours of sleep do you get in a 24-hour period?') +
xlab('Heart Disease') +
ylab('Sleep Time')+
scale_fill_brewer(palette = 'Set', name = 'Gender')
t.test(male_HD$SleepTime,female_HD$SleepTime)
##
## Welch Two Sample t-test
##
## data: male_HD$SleepTime and female_HD$SleepTime
## t = 4, df = 22378, p-value = 2e-05
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0515 0.1389
## sample estimates:
## mean of x mean of y
## 7.18 7.08
Visually it look like men and women get the same amount of sleep. However, the t-test shows that men and women have statistically different sleep times. Men on average get about 35 more minutes of sleep a week than women. This may not sound like much but sleep effects are known to be cumulative. A study conducted by Robbins et al. in 2021 “examined a group of people who got the recommended seven to eight hours of sleep a night to those who got less (even just 30 minutes less), and exposed them to a norovirus”. The study found that curtailing sleep below the recommended seven hours increases the risk for viral infection “nearly four times higher among those who are sleeping poorly or getting insufficient sleep, compared to those who are getting good rest” (Robbins et al, 2021).
This lack of sleep could explain why we see that women in our data on average report feeling sick more often and could also be a contributing factors is why women are more likely to die from heart attacks than men (Noel Bairey Merz).
#Smoking
ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle(' Have you smoked at least 100 cigarettes in your entire life?') +
xlab('Heart Disease') +
ylab('Smoking')+
scale_fill_brewer(palette = 'Set2', name = 'Gender')
t.test ( male_HD$Smoking,female_HD$Smoking)
##
## Welch Two Sample t-test
##
## data: male_HD$Smoking and female_HD$Smoking
## t = 17, df = 23643, p-value <2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.0941 0.1178
## sample estimates:
## mean of x mean of y
## 0.629 0.523
The number of men and women smoking in our data set are in equal proportions. However, the t-test shows that men average smoke slightly more than women but this difference does not seem significant.
ggplot(heart, aes(x = factor(HeartDisease), y = Smoking))+
geom_boxplot(aes(fill = Sex),outlier.shape = NA)+
ggtitle('How many drinks have you had this week?') +
xlab('Heart Disease') +
ylab('Alcohol Consumption')+
scale_fill_brewer(palette = 'Set4', name = 'Gender')
t.test ( male_HD$AlcoholDrinking,
female_HD$AlcoholDrinking)
##
## Welch Two Sample t-test
##
## data: male_HD$AlcoholDrinking and female_HD$AlcoholDrinking
## t = 3, df = 25196, p-value = 0.01
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.00133 0.01083
## sample estimates:
## mean of x mean of y
## 0.0442 0.0381
Note than in the questionnaire, heavy drinkers are defined as “adult men having more than 14 drinks per week and adult women having more than 7 drinks per week”. Even though, the number of drinkers are in equal proportions for both male and female, the number of drinks per between the populations are different. The threshold for being labelled a heavy drinker for women is half of that for a man.Therefore, the mean difference shown in the t-test might look small but on a per drink basis it is significant. Thus, we can conclude that men drink more on a weekly basis than women.
HD <- subset(heart, HeartDisease==1)
freq(HD$AgeCategory)
## var frequency percentage cumulative_perc
## 1 80 or older 5449 19.91 19.9
## 2 70-74 4847 17.71 37.6
## 3 65-69 4101 14.98 52.6
## 4 75-79 4049 14.79 67.4
## 5 60-64 3327 12.15 79.5
## 6 55-59 2202 8.04 87.6
## 7 50-54 1383 5.05 92.6
## 8 45-49 744 2.72 95.4
## 9 40-44 486 1.78 97.1
## 10 35-39 296 1.08 98.2
## 11 30-34 226 0.83 99.0
## 12 25-29 133 0.49 99.5
## 13 18-24 130 0.47 100.0
freq(HD$Race)
## var frequency percentage cumulative_perc
## 1 White 22507 82.22 82.2
## 2 Black 1729 6.32 88.5
## 3 Hispanic 1443 5.27 93.8
## 4 Other 886 3.24 97.0
## 5 American Indian/Alaskan Native 542 1.98 99.0
## 6 Asian 266 0.97 100.0
freq(HD$GenHealth)
## var frequency percentage cumulative_perc
## 1 Good 9558 34.92 34.9
## 2 Fair 7084 25.88 60.8
## 3 Very good 5381 19.66 80.5
## 4 Poor 3850 14.06 94.5
## 5 Excellent 1500 5.48 100.0
When looking at the population make-up of respondents who do have heart disease, we see that a higher percentage of respondents are 60 and older, with 82% being of Caucasian descent. At the time of the interview they also said they felt their overall health is good. This may be a limiting factor for future research since, once again, the sample does not have many data points from non-Caucasian individuals. We also see that most of our respondents with heart disease are older above 60 years of age, this is in-line with theory.
Even though a logit regression seems more appropriate to estimate whether or not an individual gets heart disease, we have used a linear regression in our exploratory stages to find the best fitting model.
reg_1 <- lm(HeartDisease~Sex+BMI, data=heart)
print('Model One BIC')
## [1] "Model One BIC"
BIC(reg_1)
## [1] 90503
summary (reg_1)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.2521 -0.1024 -0.0805 -0.0587 0.9677
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.03e-03 2.29e-03 2.2 0.028 *
## SexMale 3.85e-02 9.87e-04 39.0 <2e-16 ***
## BMI 2.20e-03 7.76e-05 28.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.279 on 319792 degrees of freedom
## Multiple R-squared: 0.0074, Adjusted R-squared: 0.00739
## F-statistic: 1.19e+03 on 2 and 319792 DF, p-value: <2e-16
reg_2 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + SleepTime, data=heart)
print('Model Two BIC')
## [1] "Model Two BIC"
BIC(reg_2)
## [1] 80963
summary (reg_2)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth +
## SleepTime, data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.3585 -0.0918 -0.0726 -0.0425 1.0001
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.18e-02 3.41e-03 -6.40 1.5e-10 ***
## SexMale 4.22e-02 9.78e-04 43.10 < 2e-16 ***
## BMI 1.43e-03 7.70e-05 18.55 < 2e-16 ***
## PhysicalHealth 6.18e-03 6.41e-05 96.37 < 2e-16 ***
## MentalHealth -4.95e-04 6.44e-05 -7.69 1.5e-14 ***
## SleepTime 3.95e-03 3.41e-04 11.58 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.275 on 319789 degrees of freedom
## Multiple R-squared: 0.0367, Adjusted R-squared: 0.0367
## F-statistic: 2.44e+03 on 5 and 319789 DF, p-value: <2e-16
reg_3 <- lm(HeartDisease~Sex+BMI+ PhysicalHealth+MentalHealth + Smoking + AlcoholDrinking+ SleepTime + AgeCategory + Race, data=heart)
print('Model Three BIC')
## [1] "Model Three BIC"
BIC(reg_3)
## [1] 60289
summary (reg_3)
##
## Call:
## lm(formula = HeartDisease ~ Sex + BMI + PhysicalHealth + MentalHealth +
## Smoking + AlcoholDrinking + SleepTime + AgeCategory + Race,
## data = heart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4570 -0.1220 -0.0529 -0.0020 1.0648
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.16e-02 5.25e-03 -11.74 < 2e-16 ***
## SexMale 4.88e-02 9.57e-04 50.95 < 2e-16 ***
## BMI 2.05e-03 7.62e-05 26.92 < 2e-16 ***
## PhysicalHealth 4.61e-03 6.32e-05 72.95 < 2e-16 ***
## MentalHealth 9.31e-04 6.39e-05 14.56 < 2e-16 ***
## Smoking 3.50e-02 9.92e-04 35.27 < 2e-16 ***
## AlcoholDrinking -2.25e-02 1.89e-03 -11.93 < 2e-16 ***
## SleepTime -1.39e-03 3.33e-04 -4.16 3.1e-05 ***
## AgeCategory25-29 -6.29e-03 2.75e-03 -2.29 0.0222 *
## AgeCategory30-34 -6.45e-03 2.69e-03 -2.40 0.0166 *
## AgeCategory35-39 -5.83e-03 2.64e-03 -2.21 0.0269 *
## AgeCategory40-44 1.14e-03 2.63e-03 0.43 0.6643
## AgeCategory45-49 1.10e-02 2.61e-03 4.20 2.6e-05 ***
## AgeCategory50-54 2.92e-02 2.52e-03 11.57 < 2e-16 ***
## AgeCategory55-59 4.63e-02 2.44e-03 18.96 < 2e-16 ***
## AgeCategory60-64 6.98e-02 2.39e-03 29.17 < 2e-16 ***
## AgeCategory65-69 9.48e-02 2.39e-03 39.60 < 2e-16 ***
## AgeCategory70-74 1.32e-01 2.44e-03 54.00 < 2e-16 ***
## AgeCategory75-79 1.65e-01 2.65e-03 62.18 < 2e-16 ***
## AgeCategory80 or older 2.08e-01 2.58e-03 80.59 < 2e-16 ***
## RaceAsian -2.45e-02 4.75e-03 -5.15 2.5e-07 ***
## RaceBlack -2.01e-02 4.09e-03 -4.91 9.1e-07 ***
## RaceHispanic -1.75e-02 4.03e-03 -4.35 1.4e-05 ***
## RaceOther -1.30e-02 4.48e-03 -2.90 0.0037 **
## RaceWhite -2.04e-02 3.73e-03 -5.48 4.3e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.266 on 319770 degrees of freedom
## Multiple R-squared: 0.0977, Adjusted R-squared: 0.0976
## F-statistic: 1.44e+03 on 24 and 319770 DF, p-value: <2e-16
With each new addition of a variable, we see that the adjusted R-squared is increasing indicating that the model fit is improving. However, the fit statistic is still quite low. Note that the 40-44 years age category has insignificant p-values. Since model 3 has the highest adjusted r-squared and the lowest BIC values, it will be our preferred model.
ezids::xkablevif(reg_3, wide=FALSE)
| V1 | |
|---|---|
| AgeCategory25-29 | 1.03 |
| AgeCategory30-34 | 1.06 |
| AgeCategory35-39 | 1.14 |
| AgeCategory40-44 | 1.17 |
| AgeCategory45-49 | 1.08 |
| AgeCategory50-54 | 1.02 |
| AgeCategory55-59 | 1.04 |
| AgeCategory60-64 | 1.72 |
| AgeCategory65-69 | 1.81 |
| AgeCategory70-74 | 1.89 |
| AgeCategory75-79 | 1.92 |
| AgeCategory80 or older | 1.95 |
| AlcoholDrinking | 2.10 |
| BMI | 2.28 |
| MentalHealth | 2.45 |
| PhysicalHealth | 2.47 |
| RaceAsian | 2.37 |
| RaceBlack | 1.99 |
| RaceHispanic | 2.10 |
| RaceOther | 2.51 |
| RaceWhite | 5.04 |
| SexMale | 5.77 |
| SleepTime | 3.00 |
| Smoking | 11.28 |
From, the VIF table, the VIF value for smoking is 11.28, this is undesirable and thus will be dropped. Therefore our preferred model is as follows:
HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory
We will train and test the model to see the cost analysis of our predictions.
In order to address the unbalanced classification, we subsetted our original data frame to create a new one (called Total) that had equal proportions of the Heart Disease category. The new dataset, called Total, has over 54000 observations that have been split into an 80-20 train and test sets.
# Creating a new dataframe (called total) with equal proportions of Heart Disease values
newdata1<- subset(heart, HeartDisease==1)
newdata1_1<-head(newdata1,27373)
nrow(newdata1_1)
## [1] 27373
newdata0<- subset(heart, HeartDisease==0)
newdata0_1<-head(newdata0,27373)
nrow(newdata0_1)
## [1] 27373
total <- rbind(newdata1_1, newdata0_1)
#nrow(total)
#summary(total)
#freq(total$HeartDisease)
# Convert to factor variables
total$HeartDisease<-factor(total$HeartDisease)
total$Sex<-factor(total$Sex)
# Split into 80-20 train-test sets
set.seed(1234)
sample <- sample.int(n = nrow(total), size = floor(.80*nrow(total)), replace = F)
train <- total[sample, ]
test <- total[-sample, ]
# Logit Model Training
model<- glm(HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory, data= train, family='binomial')
#summary(model)
#Chi-squared test for model fit
anova(model, test = 'Chisq')
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: HeartDisease
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 43795 60714
## Sex 1 704 43794 60010 < 2e-16 ***
## BMI 1 510 43793 59499 < 2e-16 ***
## PhysicalHealth 1 2416 43792 57083 < 2e-16 ***
## MentalHealth 1 28 43791 57055 9.4e-08 ***
## AlcoholDrinking 1 124 43790 56931 < 2e-16 ***
## SleepTime 1 10 43789 56921 0.0016 **
## AgeCategory 12 9273 43777 47647 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the anova chi-squared results, we can conclude that the logit model (HeartDisease ~ Sex + BMI+ PhysicalHealth + MentalHealth + AlcoholDrinking + SleepTime + AgeCategory) used is a good fit model for predicting the probability of getting heart disease. All the coefficient p-values in the model are also significant, except for the 25-39 age range, which is significant at the 10% level.
# Exponential Coefficients & Confidence Interval
exp_coeff<- exp(cbind(OR = coef(model), confint(model)))
knitr::kable(exp_coeff,caption='Odds Ratio & Confidence Interval of Model')
| OR | 2.5 % | 97.5 % | |
|---|---|---|---|
| (Intercept) | 0.018 | 0.014 | 0.023 |
| SexMale | 2.106 | 2.014 | 2.203 |
| BMI | 1.041 | 1.037 | 1.044 |
| PhysicalHealth | 1.042 | 1.039 | 1.044 |
| MentalHealth | 1.021 | 1.018 | 1.024 |
| AlcoholDrinking | 0.797 | 0.722 | 0.879 |
| SleepTime | 0.947 | 0.933 | 0.960 |
| AgeCategory25-29 | 0.916 | 0.685 | 1.224 |
| AgeCategory30-34 | 1.602 | 1.246 | 2.068 |
| AgeCategory35-39 | 1.865 | 1.465 | 2.387 |
| AgeCategory40-44 | 3.091 | 2.463 | 3.908 |
| AgeCategory45-49 | 4.059 | 3.265 | 5.089 |
| AgeCategory50-54 | 7.385 | 5.990 | 9.188 |
| AgeCategory55-59 | 10.309 | 8.398 | 12.778 |
| AgeCategory60-64 | 14.405 | 11.766 | 17.810 |
| AgeCategory65-69 | 18.361 | 15.012 | 22.681 |
| AgeCategory70-74 | 27.403 | 22.402 | 33.855 |
| AgeCategory75-79 | 33.746 | 27.513 | 41.797 |
| AgeCategory80 or older | 51.239 | 41.794 | 63.438 |
According to the model, the odds for getting heart disease increases by 2.11 for a man compared to a woman. While an increase in BMI increases the odds for heart disease by 1.03. Mental health,physical health, drinking, and sleep time also affect your chances of getting heart disease but are not the most salient risk factors; they could be used as early indicators. As expected, the odds of getting heart disease for age increases as an individual gets older.
Also note, the confidence intervals for all the variables are narrow, indicating that they are good predictors of heart disease. However, the confidence intervals for age gets wider as we increase an individuals age. Although the uncertainty is greater at higher age ranges, there still may be enough precision to inform a patient about preventative measures, when considering their lifestyle choices.
library(ggthemes)
# Prediction
train$prediction <- predict( model, newdata = train, type = "response" )
test$prediction <- predict( model, newdata = test , type = "response" )
# distribution of the prediction score grouped by known outcome
ggplot( train, aes( prediction, color = as.factor(HeartDisease) ) ) +
geom_density( size = 1 ) +
ggtitle( "Training Set's Predicted Score" ) +
scale_color_economist( name = "data", labels = c( "No Heart Disease", "Yes Heart Disease" ) ) +
theme_economist()
# Positive instance = 1 = Yes Heart Disease
# Negative instance = 0 = No Heart Disease
After obtaining the predicted values that an individual will have heart disease on both the training and testing sets, a double density plot was drawn to show those predictions. Since we split our data evenly, the negative instances are skewed to the left while the positive instances are skewed to the right. This is an ideal density plot.
# user-defined different cost for false negative and false positive
source('unbalanced_functions.R')
cost.fp = 100
cost.fn = 200
roc_info <- ROCInfo( data = test,
predict = "prediction",
actual = "HeartDisease",
cost.fp = 100,
cost.fn = 200 )
grid.draw(roc_info$plot)
From the plots, we can see that our Area Under the Curve is about 79%, which is an acceptable model for our purposes. Specifying our different costs for false positives ($100) and false negatives ($200), we see that our total loss cost is just under $382000.
loadPkg("regclass")
xkabledply( confusion_matrix(model), title = "Confusion Matrix from Logit Model" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 14653 | 7276 | 21929 |
| Actual 1 | 4691 | 17176 | 21867 |
| Total | 19344 | 24452 | 43796 |
#unloadPkg("regclass")
#
cm_info <- ConfusionMatrixInfo( data = test,
predict = "prediction",
actual = 'HeartDisease',
cutoff = .30 )
#ggthemr("flat")
plot(cm_info$plot)
Calculated Values for Model:
Accuracy = 0.727 Mis-classification = 0.27 Sensitivity = 0.78 Specificity = 0.66
Using the optimal cut-off value of 0.30 obtained from the ROC-AUC Loss function, we see that our model a is able to calculate more positive values than negative values. The accuracy of the model is about 73%, which is not a good model in the healthcare field but can still be used as a public early detection tool to bring about individual awareness. We see that our model shows high sensitivity and low specificity, this means that our model is great at estimating actual cases of whether an individual will have heart disease but tends to have a high false positive rate. In the grand scheme life, higher false positives is not as bad as having a false negative rate. For instance, an individual A uses our model and is told that they have a higher likelihood of having heart issues in the future. However, turns out this is a false positive and A just ends up living a bit more healthier than they did before. However, it is the false negatives that we will need to adjust for in the future.
In this part we are Identifying whether Heart Disease has any relation with secondary illnesses and also health conditions like BMI and Difficulty in Walking
heart_dum <-read.csv("heart_2020_cleaned.csv")
heart_dum$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart_dum$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"
#Exploratory data analysis
library(ggplot2)
ggplot(heart, aes(HeartDisease)) +
geom_bar(color="black",fill="aquamarine4")
ggplot(data = heart_dum) +
geom_bar(mapping = aes(x = Stroke, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs Diabetic")
ggplot(data = heart_dum) +
geom_bar(mapping = aes(x = Diabetic, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs Diabetic")
ggplot(data = heart_dum) +
geom_bar(mapping = aes(x = KidneyDisease, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs KidneyDisease")
ggplot(data = heart_dum) +
geom_bar(mapping = aes(x = SkinCancer, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs SkinCancer")
ggplot(data = heart_dum) +
geom_bar(mapping = aes(x = Asthma, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs Asthma")
ggplot(data = heart_dum) +
geom_bar(mapping = aes(x = DiffWalking, fill = HeartDisease), position = "fill")+scale_fill_manual(values = c("Yes" = "black", "No" = "orange"))+ggtitle("HeartDisease vs Difficulty in walking")
# We are seperating our desired variables to test them in an easier way.
causes <- data.frame (heart$HeartDisease,
heart$Stroke,
heart$Diabetic,
heart$Asthma,
heart$KidneyDisease,
heart$SkinCancer,
heart$DiffWalking)
# Rename columns
names(causes) <- c ('heart_disease',
'stroke',
'diabetic',
'asthma',
'kidney_disease',
'skin_cancer',
'Difficult_walking')
cor_matrix<-cor(causes)
cor_matrix
## heart_disease stroke diabetic asthma kidney_disease
## heart_disease 1.0000 0.1968 0.1782 0.041444 0.1452
## stroke 0.1968 1.0000 0.1062 0.038866 0.0912
## diabetic 0.1782 0.1062 1.0000 0.048559 0.1466
## asthma 0.0414 0.0389 0.0486 1.000000 0.0397
## kidney_disease 0.1452 0.0912 0.1466 0.039707 1.0000
## skin_cancer 0.0933 0.0481 0.0377 -0.000396 0.0618
## Difficult_walking 0.2013 0.1741 0.2160 0.103222 0.1531
## skin_cancer Difficult_walking
## heart_disease 0.093317 0.2013
## stroke 0.048116 0.1741
## diabetic 0.037747 0.2160
## asthma -0.000396 0.1032
## kidney_disease 0.061816 0.1531
## skin_cancer 1.000000 0.0648
## Difficult_walking 0.064840 1.0000
model.matrix(~0+., data=causes) %>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag = F, type="upper", lab=TRUE, lab_size=2)
From the above correlation test we can see that overall Correlations are not particularly strong. Heartdisease has more correlation with stroke than any other secondary illnesses(Which is not a surprise generally). Diabetes is correlated with heart disease and Kidney disease, Less with Stroke. Skin cancer shows negligible correlation with all other diseases. overall, we can see that only stroke, diabetes and kidney disease have considerable correlation with heart disease. Not only from our correlation test, but as per center for disease control and prevention, Common heart disorders can increase your risk for stroke. For example, coronary artery disease increases your risk for stroke, because plaque builds up in the arteries and blocks the flow of oxygen-rich blood to the brain. Also diabetes increases your risk for stroke(Stroke vs Diabetes in our correlation plot ).Diabetes causes sugars to build up in the blood and prevent oxygen and nutrients from getting to the various parts of your body, including your brain. High blood pressure is also common in people with diabetes. High blood pressure is the leading cause of stroke and is the main cause for increased risk of stroke among people with diabetes (CDC Stroke Fact sheet). (https://www.cdc.gov/stroke/conditions.htm#:~:text=Common%20heart%20disorders%20can%20increase,rich%20blood%20to%20the%20brain.)
After finding the correlation between variables we are now performing a chi-suare test of independence to find out whether the varibles are dependent or independent with each other. The Chi-Square test of independence is used to determine if there is a significant relationship between two nominal (categorical) variables(All the variables we have here are already categorical). In a more general sense, it tests to see whether distributions of categorical variables differ from each another.The frequency of each category for one nominal variable is compared across the categories of the second nominal variable. The data can be displayed in a contingency table where each row represents a category for one variable and each column represents a category for the other variable. The null hypothesis for this test is that there is no relationship between primary disease heartDisease and other secondary illnesses. The alternative hypothesis is that there is a relationship between heartDisease and secondary illnesses.
output: the title of the test, which variables have been used, the test statistic, the degrees of freedom and the p-value of the test.
hea_as=table(heart$HeartDisease,heart$Asthma)
chi_1=chisq.test(hea_as)
chi_1
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hea_as
## X-squared = 549, df = 1, p-value <2e-16
hea_dia=table(heart$HeartDisease,heart$Diabetic)
chi_2=chisq.test(hea_dia)
chi_2
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hea_dia
## X-squared = 10151, df = 1, p-value <2e-16
hea_kid=table(heart$HeartDisease,heart$KidneyDisease)
chi_3=chisq.test(hea_kid)
chi_3
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hea_kid
## X-squared = 6739, df = 1, p-value <2e-16
hea_skin=table(heart$HeartDisease,heart$SkinCancer)
chi_4=chisq.test(hea_skin)
chi_4
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hea_skin
## X-squared = 2784, df = 1, p-value <2e-16
hea_stroke=table(heart$HeartDisease,heart$Stroke)
chi_5=chisq.test(hea_stroke)
chi_5
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hea_stroke
## X-squared = 12386, df = 1, p-value <2e-16
hea_walk=table(heart$HeartDisease,heart$DiffWalking)
chi_6=chisq.test(hea_walk)
chi_6
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: hea_walk
## X-squared = 12951, df = 1, p-value <2e-16
From the output and from test$p.value we see that the
p-value is less than the significance level of 5%. Like any other
statistical test, if the p-value is less than the significance level, we
can reject the null hypothesis.Every secondary illness (Diabetic,Asthma,
Kidney disease and skin cancer) has a relationship with heart
Disease.
library(tidyverse)
library(caret)
heartsub <- data.frame(heart)
heartsub <- heartsub[, c("HeartDisease","Stroke","Diabetic","Asthma","KidneyDisease","SkinCancer","DiffWalking","BMI")]
newdata1<- subset(heartsub, HeartDisease==1)
newdata1_1<-head(newdata1,20000)
newdata0<- subset(heartsub, HeartDisease==0)
newdata0_1<-head(newdata0,20000)
lokesh <- rbind(newdata1_1, newdata0_1)
set.seed(100)
heart_train_rows <- sample(1:nrow(lokesh),
0.7*nrow(lokesh))
heart_train <- lokesh[heart_train_rows, ]
heart_test <- lokesh[-heart_train_rows, ]
mod1 <- glm(HeartDisease ~ Stroke + Diabetic + Asthma + KidneyDisease
+ SkinCancer+DiffWalking+BMI,
data=heart_train,family = "binomial")
summary(mod1)
##
## Call:
## glm(formula = HeartDisease ~ Stroke + Diabetic + Asthma + KidneyDisease +
## SkinCancer + DiffWalking + BMI, family = "binomial", data = heart_train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.685 -0.893 -0.886 1.082 1.502
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.75273 0.06081 -12.38 <2e-16 ***
## Stroke 1.49484 0.05690 26.27 <2e-16 ***
## Diabetic 0.95070 0.03285 28.94 <2e-16 ***
## Asthma 0.04605 0.03693 1.25 0.21
## KidneyDisease 0.87482 0.05707 15.33 <2e-16 ***
## SkinCancer 0.61962 0.03806 16.28 <2e-16 ***
## DiffWalking 0.92016 0.03267 28.16 <2e-16 ***
## BMI 0.00118 0.00211 0.56 0.58
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 38816 on 27999 degrees of freedom
## Residual deviance: 34155 on 27992 degrees of freedom
## AIC: 34171
##
## Number of Fisher Scoring iterations: 4
expcoeff = exp(coef(mod1))
# expcoeff
xkabledply( as.table(expcoeff), title = "Exponential of coefficients in Logit Reg" )
| x | |
|---|---|
| (Intercept) | 0.471 |
| Stroke | 4.459 |
| Diabetic | 2.587 |
| Asthma | 1.047 |
| KidneyDisease | 2.398 |
| SkinCancer | 1.858 |
| DiffWalking | 2.510 |
| BMI | 1.001 |
#loadPkg("regclass")
# confusion_matrix
xkabledply( confusion_matrix(mod1), title = "Confusion matrix from Logit Model" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 10343 | 3724 | 14067 |
| Actual 1 | 5246 | 8687 | 13933 |
| Total | 15589 | 12411 | 28000 |
#unloadPkg("regclass")
Accuracy - It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)
loadPkg("pROC")
library(pROC)
library(ROCR)
test_prob <- predict(mod1, newdata = heart_test, type = "response")
train_prob <-predict(mod1, newdata = heart_train, type = "response")
test_roc <- roc(heart_test$HeartDisease ~ test_prob, plot = TRUE, print.auc = TRUE)
fittedheartTrain=ifelse(train_prob<0.5,0,1 )
ConfMatTrain=table(heart_train$HeartDisease,fittedheartTrain)
TrainAccuracy=(ConfMatTrain[1,1]+ConfMatTrain[2,2])/sum(ConfMatTrain)
TrainTPR=72/(72+11)
TrainFPR=10/(10+79)
print(cbind(TrainAccuracy, TrainTPR, TrainFPR))
## TrainAccuracy TrainTPR TrainFPR
## [1,] 0.68 0.867 0.112
fittedheartTest=ifelse(test_prob<0.5,0,1 )
ConfMatTest=table(heart_test$HeartDisease,fittedheartTest)
TestAccuracy=(ConfMatTest[1,1]+ConfMatTest[2,2])/sum(ConfMatTest)
TestTPR=45/(45+10)
TestFPR=1-52/(52+7) ## expressed as 1-true negative rate
print(cbind(TestAccuracy, TestTPR, TestFPR))
## TestAccuracy TestTPR TestFPR
## [1,] 0.687 0.818 0.119
Our problem statement is “Does smoking or drinking increase the risk of heartdisease?, Does having diabetics increase the risk of heartdiseases among smokers or drinkers?
We performed some basic EDA through visualizing the dataset by plotting graphs. After preliminary analysis the following observations are made:
Heartdisease variable is heavily skewed. The percentage of people with no heartdiseases is very high as compared to people with heartdisease. As heartdisease is the target variable this might effect the prediction of the model.
In order to fix this we have made some subset of the dataset with equal proportions of yes and no of heart disease variable ##### Graphs
From the graph between heartdisease and smoking we can say that people who smoke have a high tendency for getting heartdiseases which is true with real life scenarios.
From the graph between heartdisease and drinking we can say the proposition of drinkers among people with heartdisease is less than the people without heartdisease. Ths is contrary to real life assumption which assumes that people who drink are more prone to heartdisease.
From the graph people with diabetics shows us that people with diabetics are more prone to heartdiseases.
From the graph between diabetics and smoking we can conclude that the chances of people having diabetics is more among smokers.
From the graph between diabetics and alcoholdrinking, there are less people with diabetics who drinks as compared to people with diabetics.
We have observed the following trends between heartdisease and smoking, alcoholdrinking and diabetics from the preliminary analysis:
The chances of people having heartdiseases is more when they have smoking habit as compared to drinking. people with diabetics are more prone to heartdiseases. Diabetics follows the same trend as heartdisease, where people who smoke are more prone to having diabetics than people who drink.
After graphs we have done some statistical analysis of the variables. From the correlation coefficients between heartdisease and smoking, drinking, diabetics are 0.184, -0.0609, 0.266 respectively. Heartdisease and diabetics are strongly correlated when compared to drinking and smoking. Drinking is least and negatively corealted with heartdisease. The corelation coefficients between diabetics, and smoking, alcoholdrinking are 0.0733, -0.0705 respectively. Diabetics and smoking are lightly correlated with a positive coefficients of 0.073 and from the coefficient values we can say that smoking and drinking does not effect diabetic to a great extent.
From the contengency table, approximetly 30% of people who has heartdiseases has an smoking habit, only 20% of people who does not smoke have heartdisease. 2% of drinkers have heartdisease whereas 3% of known drinkers have heartdisease. This is the reason for negative correaltion between heartdisease and drinking. Out of total heart patients 18% have diabetics and 7% of people have diabetics without any heartdisease.
We built some models to see the statistical correlation between heartdisease and smoking, alcoholdrinking, and diabetic and try to predict the chances of having heartdisease with the above mentioned variables. Further, we checked the chances of having heartdisease has increased or decreased when diabetics is taken into consideration among smokers and drinkers.
We used logistic regression model to predict the chances of having heartdisease based on the variables as this is a classification problem. We built 2 models, model_lr and model_lr2.
model_lr has smoking and alcoholdrinking as independent variables and heartdisese as dependent variable with interaction between smoking and alocoholdrinking. In model_lr2 smoking, alcoholdrinking, diabetics are independent variables with interaction between diabetics and alcoholdrinking and diabetics and smoking.
From p-values both smoking and alcoholdrinking are statistically significant. For every one percent increase in smoking there is an increase of 0.7891% in heartdisease. For every one percent increase in alcoholdrinking there is a decrease of 0.7844% in heartdisease. The high z-value 39.31 of smoking tells us that smoking and heardisease are strongly correaltion. The high p-value between smoking and alcoholdrinking signifies no statistical correlation between the two-variable.
We have got an accuaracy of 58.9 for this model_lr which is not good because of the fact that we have a equal distribution of heartdisease patients in the dataset.
Odd’s ratio is the proprobility of having heartdiseases to having no heartdiseases for a person with smoking habit. For every one unit increase in smoking the chance of getting heartdisease is 2.2. For every one unit increase in alcoholdrinking the chance of getting heartdisease is 0.45.
From confusion matrix we can see that our model has predicted true positive only 12157 times and 7867 times false positive. Despite having an close to 60% accuracy false positive rate is very high when compared with true positive. So, the model is not good.
From the roc-curve we can conclude that the accuracy is not good as it is close to diagnal.
Now for model_lr2 all the 3 variables(smoking,drinking,diabetics) are statistically significant. The effect of diabetics on heartdisease with people having smoking habit is statistically significant. There is no strong relation between diabetics and alcoholdrinking in heartdisease patients.
The accuracy of the model_lr2 has increased when diabetics is added this shows the significances of diabetics in heartdisease patients.
For model_lr2 from the expontential coffectients, or every one unit increase in diabetics and smoking there is a 0.8 increase in heartdisease.
From the confusion matrix, true positive is 15421 and false negative is 9483 which is slightly better than model one but can not say but the accuracy is good as there is a lot of false negatives.
Roc curve is slightly better than model_lr.
To conclude, model_lr2 performs slightly better than model_lr which signifies the a strong correlation between diabetics and heartdisease which is inturn strongly correlated in smoking. The correaltions predicted by the model between heartdisease and smoking, diabetics are similar to real life scenarios. Drinking is not significantly related with the heartdisease which is not true in real life situations. This arises from the fact that there are very few observations of true positives cases in alcoholdrinking. Accuracy of 58 and 63% are decent for the number of false positives are too high which makes the model unrealiable.
barplot(table(total$Smoking, total$HeartDisease), legend= rownames(table(total$Smoking, total$HeartDisease)), col =(c("orange", "blue")), xlab="heartdisease", ylab="smoking",main = "Barplot for Heartdisease and Smoking")
barplot(table(total$AlcoholDrinking, total$HeartDisease), legend= rownames(table(total$AlcoholDrinking, total$HeartDisease)), col =(c("red", "darkblue")), xlab="Heartdisease", ylab="Alcoholdrinking", main = "Barplot for Heartdisease and Alcoholdrinking")
barplot(table(total$Diabetic, total$HeartDisease), legend= rownames(table(total$Diabetic, total$HeartDisease)), col =(c("green", "brown")),xlab="HeartDisease",ylab="Diabetics", main = "Barplot for Heartdisease and Diabetic")
barplot(table(total$Diabetic, total$Smoking), legend= rownames(table(total$Diabetic, total$Smoking)), col =(c("red", "green")),xlab="Smoking",ylab="Diabetics", main = "Barplot for Smoking and Diabetic")
barplot(table(total$Diabetic, total$AlcoholDrinking), legend= rownames(table(total$Diabetic, total$AlcoholDrinking)), col =(c("red", "blue")),xlab="AlcoholDrinking",ylab="Diabetics", main = "Barplot for AlcoholDrinking and Diabetic")
total$HeartDisease <- as.numeric(total$HeartDisease)
cor_as_hd <- cor(total$HeartDisease, total$AlcoholDrinking)
cor_as_hd
## [1] -0.0609
cor_hd_smoke <- cor(total$HeartDisease, total$Smoking)
cor_hd_smoke
## [1] 0.184
cor_hd_dia <- cor(total$HeartDisease, total$Diabetic)
cor_hd_dia
## [1] 0.266
cor_dia_smoke <- cor(total$Diabetic, total$Smoking)
cor_dia_smoke
## [1] 0.0733
cor_dia_drink <- cor(total$Diabetic, total$AlcoholDrinking)
cor_dia_drink
## [1] -0.0705
contable <- table(total$HeartDisease, total$Smoking)
prop <-prop.table(contable)
percent <- round(prop*100, 2)
xkabledply(percent, title="Contingency Table for heart Disease and Smoking in Percentage")
| 0 | 1 |
|---|---|
| 29.9 | 20.1 |
| 20.7 | 29.3 |
contable2 <- table(total$HeartDisease, total$AlcoholDrinking)
prop <-prop.table(contable2)
percent <- round(prop*100, 2)
xkabledply(percent, title="Contingency Table for heart Disease and AlcoholDrinking in Percentage")
| 0 | 1 |
|---|---|
| 46.5 | 3.48 |
| 47.9 | 2.08 |
contable3 <- table(total$HeartDisease, total$Diabetic)
prop <-prop.table(contable3)
percent <- round(prop*100, 2)
xkabledply(percent, title="Contingency Table for heart Disease and Diabetics in Percentage")
| 0 | 1 |
|---|---|
| 43.6 | 6.39 |
| 32.2 | 17.80 |
model_lr <- glm(HeartDisease ~ Smoking + AlcoholDrinking + AlcoholDrinking:Smoking, data = train , family = "binomial")
summary(model_lr)
##
## Call:
## glm(formula = HeartDisease ~ Smoking + AlcoholDrinking + AlcoholDrinking:Smoking,
## family = "binomial", data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.367 -1.031 -0.746 0.999 1.683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.3538 0.0139 -25.44 <2e-16 ***
## Smoking 0.7891 0.0201 39.31 <2e-16 ***
## AlcoholDrinking -0.7844 0.0828 -9.47 <2e-16 ***
## Smoking:AlcoholDrinking 0.1050 0.0978 1.07 0.28
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 60714 on 43795 degrees of freedom
## Residual deviance: 58880 on 43792 degrees of freedom
## AIC: 58888
##
## Number of Fisher Scoring iterations: 4
xkabledply(model_lr, title = paste("Logistic Regression :", format(formula(model_lr))))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.354 | 0.0139 | -25.44 | 0.000 |
| Smoking | 0.789 | 0.0201 | 39.31 | 0.000 |
| AlcoholDrinking | -0.784 | 0.0828 | -9.47 | 0.000 |
| Smoking:AlcoholDrinking | 0.105 | 0.0978 | 1.07 | 0.283 |
fitted.results <- predict(model_lr,newdata = test, type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$HeartDisease)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.589680365296804"
expcoeff = exp(coef(model_lr))
xkabledply( as.table(expcoeff), title = "Exponential of coefficients in Logit Reg" )
| x | |
|---|---|
| (Intercept) | 0.702 |
| Smoking | 2.201 |
| AlcoholDrinking | 0.456 |
| Smoking:AlcoholDrinking | 1.111 |
knitr::kable(expcoeff,caption='Odds Ratio & Confidence Interval of Model')
| x | |
|---|---|
| (Intercept) | 0.702 |
| Smoking | 2.201 |
| AlcoholDrinking | 0.456 |
| Smoking:AlcoholDrinking | 1.111 |
xkabledply(confint(model_lr), title = "CIs using profiled log-likelihood" )
xkabledply(confint.default(model_lr), title = "CIs using standard errors" )
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -0.3811 | -0.327 |
| Smoking | 0.7497 | 0.828 |
| AlcoholDrinking | -0.9490 | -0.624 |
| Smoking:AlcoholDrinking | -0.0852 | 0.298 |
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -0.3811 | -0.327 |
| Smoking | 0.7497 | 0.828 |
| AlcoholDrinking | -0.9468 | -0.622 |
| Smoking:AlcoholDrinking | -0.0866 | 0.297 |
new_heart <- subset(heart,select = c(HeartDisease, Smoking, AlcoholDrinking, Diabetic))
cor_matrix<-cor(new_heart)
cor_matrix
## HeartDisease Smoking AlcoholDrinking Diabetic
## HeartDisease 1.0000 0.1078 -0.0321 0.1782
## Smoking 0.1078 1.0000 0.1118 0.0577
## AlcoholDrinking -0.0321 0.1118 1.0000 -0.0579
## Diabetic 0.1782 0.0577 -0.0579 1.0000
model.matrix(~0+., data=new_heart) %>%
cor(use="pairwise.complete.obs") %>%
ggcorrplot(show.diag = F, type="upper", lab=TRUE, lab_size=3)
loadPkg("regclass")
xkabledply(confusion_matrix(model_lr), title = "Confusion matrix from Logit Model" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 14062 | 7867 | 21929 |
| Actual 1 | 9710 | 12157 | 21867 |
| Total | 23772 | 20024 | 43796 |
loadPkg("pROC")
library(ROCR)
p <- predict(model_lr, newdata=test, type="response")
pr <- prediction(p, test$HeartDisease)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.593
#area of curve is 0.593
model_lr2 <- glm(HeartDisease ~ Smoking + AlcoholDrinking + Diabetic + Diabetic:Smoking + Diabetic:AlcoholDrinking, data = train , family = "binomial")
summary(model_lr2)
##
## Call:
## glm(formula = HeartDisease ~ Smoking + AlcoholDrinking + Diabetic +
## Diabetic:Smoking + Diabetic:AlcoholDrinking, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.779 -0.914 -0.718 1.123 1.721
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.65649 0.01610 -40.77 <2e-16 ***
## Smoking 0.78652 0.02281 34.49 <2e-16 ***
## AlcoholDrinking -0.56685 0.04808 -11.79 <2e-16 ***
## Diabetic 1.36178 0.03511 38.79 <2e-16 ***
## Smoking:Diabetic -0.13908 0.05019 -2.77 0.0056 **
## AlcoholDrinking:Diabetic 0.00964 0.13462 0.07 0.9429
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 60714 on 43795 degrees of freedom
## Residual deviance: 55946 on 43790 degrees of freedom
## AIC: 55958
##
## Number of Fisher Scoring iterations: 4
xkabledply(model_lr2, title = paste("Logistic Regression :", format(formula(model_lr2))))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -0.6565 | 0.0161 | -40.7697 | 0.0000 |
| Smoking | 0.7865 | 0.0228 | 34.4882 | 0.0000 |
| AlcoholDrinking | -0.5669 | 0.0481 | -11.7909 | 0.0000 |
| Diabetic | 1.3618 | 0.0351 | 38.7872 | 0.0000 |
| Smoking:Diabetic | -0.1391 | 0.0502 | -2.7708 | 0.0056 |
| AlcoholDrinking:Diabetic | 0.0096 | 0.1346 | 0.0716 | 0.9429 |
fitted.results <- predict(model_lr2,newdata = test, type='response')
fitted.results <- ifelse(fitted.results > 0.5,1,0)
misClasificError <- mean(fitted.results != test$HeartDisease)
print(paste('Accuracy',1-misClasificError))
## [1] "Accuracy 0.629041095890411"
expcoeff2 = exp(coef(model_lr2))
xkabledply( as.table(expcoeff2), title = "Exponential of coefficients in Logit Reg" )
| x | |
|---|---|
| (Intercept) | 0.519 |
| Smoking | 2.196 |
| AlcoholDrinking | 0.567 |
| Diabetic | 3.903 |
| Smoking:Diabetic | 0.870 |
| AlcoholDrinking:Diabetic | 1.010 |
knitr::kable(expcoeff2,caption='Odds Ratio & Confidence Interval of Model')
| x | |
|---|---|
| (Intercept) | 0.519 |
| Smoking | 2.196 |
| AlcoholDrinking | 0.567 |
| Diabetic | 3.903 |
| Smoking:Diabetic | 0.870 |
| AlcoholDrinking:Diabetic | 1.010 |
xkabledply(confint(model_lr2), title = "CIs using profiled log-likelihood" )
xkabledply(confint.default(model_lr2), title = "CIs using standard errors" )
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -0.688 | -0.6250 |
| Smoking | 0.742 | 0.8313 |
| AlcoholDrinking | -0.661 | -0.4730 |
| Diabetic | 1.293 | 1.4308 |
| Smoking:Diabetic | -0.237 | -0.0406 |
| AlcoholDrinking:Diabetic | -0.252 | 0.2761 |
| 2.5 % | 97.5 % | |
|---|---|---|
| (Intercept) | -0.688 | -0.6249 |
| Smoking | 0.742 | 0.8312 |
| AlcoholDrinking | -0.661 | -0.4726 |
| Diabetic | 1.293 | 1.4306 |
| Smoking:Diabetic | -0.237 | -0.0407 |
| AlcoholDrinking:Diabetic | -0.254 | 0.2735 |
#loadPkg("regclass")
xkabledply( confusion_matrix(model_lr2), title = "Confusion matrix from Logit Model" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 12446 | 9483 | 21929 |
| Actual 1 | 6446 | 15421 | 21867 |
| Total | 18892 | 24904 | 43796 |
loadPkg("pROC")
library(ROCR)
p1 <- predict(model_lr2, newdata=test, type="response")
pr1 <- prediction(p1, test$HeartDisease)
prf1 <- performance(pr1, measure = "tpr", x.measure = "fpr")
plot(prf1)
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
## [1] 0.593
#plot_num(heart)
heart$GenHealth<-as.factor(heart$GenHealth)
heart$GeneralHealth<-sapply(heart$GenHealth,unclass)
#5-Very good,4-poor,3-Good,2-fair,1-excellent
General Health is a categorical variable having very good, poor, fair, good, excellent. Converted categorical variable into numerical dummies for our simple analysis. General Health with very good as 5,General Health with poor as 4, General Health with good as 3, General Health with fair as 2, General Health with excellent as 1.
freq(heart)
## Sex frequency percentage cumulative_perc
## 1 Female 167805 52.5 52.5
## 2 Male 151990 47.5 100.0
## AgeCategory frequency percentage cumulative_perc
## 1 65-69 34151 10.68 10.7
## 2 60-64 33686 10.53 21.2
## 3 70-74 31065 9.71 30.9
## 4 55-59 29757 9.31 40.2
## 5 50-54 25382 7.94 48.2
## 6 80 or older 24153 7.55 55.7
## 7 45-49 21791 6.81 62.5
## 8 75-79 21482 6.72 69.2
## 9 18-24 21064 6.59 75.8
## 10 40-44 21006 6.57 82.4
## 11 35-39 20550 6.43 88.8
## 12 30-34 18753 5.86 94.7
## 13 25-29 16955 5.30 100.0
## Race frequency percentage cumulative_perc
## 1 White 245212 76.68 76.7
## 2 Hispanic 27446 8.58 85.3
## 3 Black 22939 7.17 92.4
## 4 Other 10928 3.42 95.9
## 5 Asian 8068 2.52 98.4
## 6 American Indian/Alaskan Native 5202 1.63 100.0
## GenHealth frequency percentage cumulative_perc
## 1 Very good 113858 35.60 35.6
## 2 Good 93129 29.12 64.7
## 3 Excellent 66842 20.90 85.6
## 4 Fair 34677 10.84 96.5
## 5 Poor 11289 3.53 100.0
## [1] "Variables processed: Sex, AgeCategory, Race, GenHealth"
heart$Diabetic<- replace (heart$Diabetic, heart$Diabetic== "No, borderline diabetes" , "Yes")
heart$Diabetic[heart$Diabetic== "Yes (during pregnancy)"] <- "No"
#Checking if diabetic has been changed
describe (heart$Diabetic)
## heart$Diabetic
## n missing distinct
## 319795 0 2
##
## Value 0 1
## Frequency 272212 47583
## Proportion 0.851 0.149
freq(heart$Diabetic)
## var frequency percentage cumulative_perc
## 1 0 272212 85.1 85.1
## 2 1 47583 14.9 100.0
heart$Smoking<-ifelse(heart$Smoking=="Yes",1,0)
heart$AlcoholDrinking<-ifelse(heart$AlcoholDrinking=="Yes",1,0)
heart$Stroke<-ifelse(heart$Stroke=="Yes",1,0)
heart$DiffWalking<-ifelse(heart$DiffWalking=="Yes",1,0)
heart$PhysicalActivity<-ifelse(heart$PhysicalActivity=="Yes",1,0)
heart$KidneyDisease<-ifelse(heart$KidneyDisease=="Yes",1,0)
heart$SkinCancer<-ifelse(heart$SkinCancer=="Yes",1,0)
heart$Diabetic<-ifelse(heart$Diabetic=="Yes",1,0)
heart$Asthma<-ifelse(heart$Asthma=="Yes",1,0)
heart$Race<-as.factor(heart$Race)
heart$Ethnicity<-sapply(heart$Race,unclass)
Race is a categorical variable having all the race. Converted Race into numerical dummies for EDA .Converted white as 6, Hispanic as 5,Black as 4, Other as 3, Asian as 2, American Indian as 1.
freq(heart$Ethnicity)
## var frequency percentage cumulative_perc
## 1 6 245212 76.68 76.7
## 2 4 27446 8.58 85.3
## 3 3 22939 7.17 92.4
## 4 5 10928 3.42 95.9
## 5 2 8068 2.52 98.4
## 6 1 5202 1.63 100.0
In our dataset most of people are White, next highest percentage is black, next people are others,least are indians.
loadPkg("corrplot")
cor(heart$HeartDisease,heart$GeneralHealth)
## [1] -0.0111
As a rule of thumb, a correlation coefficient between 0.25 and 0.5 is considered to be a “weak” correlation between two variables. As Heart Disease increase General Health Decreases. It means chances of getting heart disease are high when general health decreases it means going bad.
contable = table(heart$GeneralHealth, heart$HeartDisease)
xkabledply(contable, title="Contingency Table for Heart Disease and GeneralHealth")
| 0 | 1 |
|---|---|
| 65342 | 1500 |
| 27593 | 7084 |
| 83571 | 9558 |
| 7439 | 3850 |
| 108477 | 5381 |
addmargins(contable)
##
## 0 1 Sum
## 1 65342 1500 66842
## 2 27593 7084 34677
## 3 83571 9558 93129
## 4 7439 3850 11289
## 5 108477 5381 113858
## Sum 292422 27373 319795
prop <-prop.table(contable)
percent <- round(prop*100, 2)
print ("Contingency Table for heart Disease and General Health in Percentage")
## [1] "Contingency Table for heart Disease and General Health in Percentage"
print (percent)
##
## 0 1
## 1 20.43 0.47
## 2 8.63 2.22
## 3 26.13 2.99
## 4 2.33 1.20
## 5 33.92 1.68
barplot(with(heart, table(GeneralHealth, HeartDisease)), main="heart Disease & GeneralHealth", beside=T, col=3:2)
Contigency plot is made to know the relationship between heart Disease and General Health. For General Health 1 20.43% of people haven’t faced heart Disease and 0.47 have faced HeartDisease,For General HEalth 2 8.63% of people haven’t faced heart Disease and 2.22 have faced HeartDisease. For General Health 3 of people haven’t faced heart Disease and 0.47 have face HD For General HEalth 4 2.33% of people haven’t faced heart Disease and 1.20 have face HD. For General HEalth 5 33.92% of people haven’t faced heart Disease and 1.68 have face Heart disease.
barplot(table(heart$HeartDisease, heart$GeneralHealth), legend= rownames(table(heart$HeartDisease, heart$GeneralHealth)), col =(c("blue", "red")), xlab="GeneralHealth", ylab="heartdisease",main = "Barplot for Heartdisease and GeneralHealth")
A plot is drawn to get relationship with GeneralHealth and HeartDisease.
From stacked barplot we can see that people with fair genearl health and
good general health tend to have more heart disease than compared to
all.
barplot(table(heart$HeartDisease, heart$PhysicalActivity), legend= rownames(table(heart$HeartDisease, heart$PhysicalActivity)), col =(c("orange", "blue")), xlab="PhysicalActivity", ylab="heartdisease",main = "Barplot for Heartdisease and PhysicalActivity")
We have a larger number of observations where the person is physically active. We can take away from this barplot that there is some significance for physical activity in deciding whether or not a person will develop heart disease
Genh_HD=table(heart$GeneralHealth,heart$HeartDisease)
chi_1=chisq.test(Genh_HD)
chi_1
##
## Pearson's Chi-squared test
##
## data: Genh_HD
## X-squared = 21542, df = 4, p-value <2e-16
HD_Eth=table(heart$PhysicalActivity,heart$HeartDisease)
chi_2=chisq.test(HD_Eth)
chi_2
##
## Chi-squared test for given probabilities
##
## data: HD_Eth
## X-squared = 2e+05, df = 1, p-value <2e-16
For all the chi^2 tests we performed above, the p-value<0.05. So all the above rejects null hypothesis. Every General Health,Physical Activity has a relationship with heart Disease.
modifieddata<- subset(heart, HeartDisease==1)
modifieddata_1<-head(modifieddata,27373)
nrow(modifieddata_1)
## [1] 27373
modifieddata0<- subset(heart, HeartDisease==0)
modifieddata_01<-head(modifieddata0,27373)
nrow(modifieddata_01)
## [1] 27373
totaldata <- rbind(modifieddata_1, modifieddata_01)
nrow(totaldata)
## [1] 54746
summary(totaldata)
## HeartDisease BMI Smoking AlcoholDrinking Stroke
## Min. :0.0 Min. :12.2 Min. :0 Min. :0 Min. :0
## 1st Qu.:0.0 1st Qu.:24.3 1st Qu.:0 1st Qu.:0 1st Qu.:0
## Median :0.5 Median :27.5 Median :0 Median :0 Median :0
## Mean :0.5 Mean :28.7 Mean :0 Mean :0 Mean :0
## 3rd Qu.:1.0 3rd Qu.:31.9 3rd Qu.:0 3rd Qu.:0 3rd Qu.:0
## Max. :1.0 Max. :83.3 Max. :0 Max. :0 Max. :0
## PhysicalHealth MentalHealth DiffWalking Sex
## Min. : 0.0 Min. : 0.00 Min. :0 Length:54746
## 1st Qu.: 0.0 1st Qu.: 0.00 1st Qu.:0 Class :character
## Median : 0.0 Median : 0.00 Median :0 Mode :character
## Mean : 5.5 Mean : 4.27 Mean :0
## 3rd Qu.: 5.0 3rd Qu.: 4.00 3rd Qu.:0
## Max. :30.0 Max. :30.00 Max. :0
## AgeCategory Race Diabetic
## Length:54746 American Indian/Alaskan Native: 1510 Min. :0
## Class :character Asian : 976 1st Qu.:0
## Mode :character Black : 3684 Median :0
## Hispanic : 4941 Mean :0
## Other : 1831 3rd Qu.:0
## White :41804 Max. :0
## PhysicalActivity GenHealth SleepTime Asthma KidneyDisease
## Min. :0 Excellent: 7576 Min. : 1.00 Min. :0 Min. :0
## 1st Qu.:0 Fair : 9884 1st Qu.: 6.00 1st Qu.:0 1st Qu.:0
## Median :0 Good :17318 Median : 7.00 Median :0 Median :0
## Mean :0 Poor : 4658 Mean : 7.14 Mean :0 Mean :0
## 3rd Qu.:0 Very good:15310 3rd Qu.: 8.00 3rd Qu.:0 3rd Qu.:0
## Max. :0 Max. :24.00 Max. :0 Max. :0
## SkinCancer GeneralHealth Ethnicity
## Min. :0 Min. :1.00 Min. :1.00
## 1st Qu.:0 1st Qu.:2.00 1st Qu.:6.00
## Median :0 Median :3.00 Median :6.00
## Mean :0 Mean :3.19 Mean :5.37
## 3rd Qu.:0 3rd Qu.:5.00 3rd Qu.:6.00
## Max. :0 Max. :5.00 Max. :6.00
set.seed(100)
sample_train_rows <- sample(1:nrow(total), 0.7*nrow(totaldata))
train <- totaldata[sample_train_rows, ]
test <- totaldata[-sample_train_rows, ]
Since most of the data is unbiased. Converted data into equal number of yes(1) and no so that data is biased and we are traing the modified data to 70% and testing data as 30%
model <- glm(HeartDisease ~ GenHealth+SleepTime+Race,data=train,family = "binomial")
summary(model)
##
## Call:
## glm(formula = HeartDisease ~ GenHealth + SleepTime + Race, family = "binomial",
## data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.128 -0.981 -0.391 0.997 2.334
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.48242 0.09414 -26.37 < 2e-16 ***
## GenHealthFair 2.43667 0.04469 54.52 < 2e-16 ***
## GenHealthGood 1.66261 0.03966 41.92 < 2e-16 ***
## GenHealthPoor 3.06311 0.05908 51.84 < 2e-16 ***
## GenHealthVery good 0.77326 0.04046 19.11 < 2e-16 ***
## SleepTime 0.03152 0.00723 4.36 1.3e-05 ***
## RaceAsian 0.10993 0.11410 0.96 0.3353
## RaceBlack 0.42937 0.08135 5.28 1.3e-07 ***
## RaceHispanic -0.23738 0.08061 -2.94 0.0032 **
## RaceOther 0.70207 0.09270 7.57 3.6e-14 ***
## RaceWhite 1.00724 0.07078 14.23 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 53125 on 38321 degrees of freedom
## Residual deviance: 45748 on 38311 degrees of freedom
## AIC: 45770
##
## Number of Fisher Scoring iterations: 4
A model is built on Gen Health, Sleeptime, Race. All the p-values are less than 0.05 except the RaceAsian remaining all the stastically significant.
xkabledply(model, title = paste("Logistic Regression :", format(formula(model)) ))
| Estimate | Std. Error | z value | Pr(>|z|) | |
|---|---|---|---|---|
| (Intercept) | -2.4824 | 0.0941 | -26.370 | 0.0000 |
| GenHealthFair | 2.4367 | 0.0447 | 54.519 | 0.0000 |
| GenHealthGood | 1.6626 | 0.0397 | 41.923 | 0.0000 |
| GenHealthPoor | 3.0631 | 0.0591 | 51.845 | 0.0000 |
| GenHealthVery good | 0.7733 | 0.0405 | 19.110 | 0.0000 |
| SleepTime | 0.0315 | 0.0072 | 4.360 | 0.0000 |
| RaceAsian | 0.1099 | 0.1141 | 0.964 | 0.3353 |
| RaceBlack | 0.4294 | 0.0813 | 5.278 | 0.0000 |
| RaceHispanic | -0.2374 | 0.0806 | -2.945 | 0.0032 |
| RaceOther | 0.7021 | 0.0927 | 7.573 | 0.0000 |
| RaceWhite | 1.0072 | 0.0708 | 14.231 | 0.0000 |
All the coefficients are found significant (small p-values). All the variables except PhysicalHealth have positive effects on HeartDisease.
expcoeff = exp(coef(model))
# expcoeff
xkabledply( as.table(expcoeff), title = "Exponential of coefficients in Logit Reg" )
| x | |
|---|---|
| (Intercept) | 0.0835 |
| GenHealthFair | 11.4350 |
| GenHealthGood | 5.2731 |
| GenHealthPoor | 21.3940 |
| GenHealthVery good | 2.1668 |
| SleepTime | 1.0320 |
| RaceAsian | 1.1162 |
| RaceBlack | 1.5363 |
| RaceHispanic | 0.7887 |
| RaceOther | 2.0179 |
| RaceWhite | 2.7380 |
#loadPkg("regclass")
# confusion_matrix
xkabledply( confusion_matrix(model), title = "Confusion matrix from Logit Model" )
| Predicted 0 | Predicted 1 | Total | |
|---|---|---|---|
| Actual 0 | 13326 | 5936 | 19262 |
| Actual 1 | 5993 | 13067 | 19060 |
| Total | 19319 | 19003 | 38322 |
#unloadPkg("regclass")
Accuracy - It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)
Accuracy-66.6
loadPkg("pROC")
library(pROC)
library(ROCR)
test_prob <- predict(model, newdata = test, type = "response")
train_prob <-predict(model, newdata = train, type = "response")
test_roc <- roc(test$HeartDisease ~ test_prob, plot = TRUE, print.auc = TRUE)
The area under curve is 73.9%
fittedheartTrain=ifelse(train_prob<0.5,0,1 )
ConfMatTrain=table(train$HeartDisease,fittedheartTrain)
TrainAccuracy=(ConfMatTrain[1,1]+ConfMatTrain[2,2])/sum(ConfMatTrain)
TrainTPR=72/(72+11)
TrainFPR=10/(10+79)
print(cbind(TrainAccuracy, TrainTPR, TrainFPR))
## TrainAccuracy TrainTPR TrainFPR
## [1,] 0.689 0.867 0.112
fittedheartTest=ifelse(test_prob<0.5,0,1 )
ConfMatTest=table(test$HeartDisease,fittedheartTest)
TestAccuracy=(ConfMatTest[1,1]+ConfMatTest[2,2])/sum(ConfMatTest)
TestTPR=45/(45+10)
TestFPR=1-52/(52+7) ## expressed as 1-true negative rate
print(cbind(TestAccuracy, TestTPR, TestFPR))
## TestAccuracy TestTPR TestFPR
## [1,] 0.684 0.818 0.119
Our dataset contains 18 variables, for the EDA Analysis we have used only a few variables for our analysis and drew some conclusions and we may use all the variables for the final that is building the model.
Telephone interviews- voluntary bias (old retired): Most of the surveys were made telephonic because of the covid and most of the people didn’t want to meet in person to risk their lives for the sake of surveys & Covid. So, they started doing telephonic surveys to find out whether the key indicators like high blood pressure, high cholesterol, and smoking among most racial and ethnic groups to know how they play a crucial role in Heart Disease. Most of the responses were rude and arrogant because they were frustrated with the covid and were not getting out of the home. Few people were very nice to surveys and were cool and they responded by stating to questions they have given relevant answers to and surveys have recorded their responses.
Liars - We are uncertain about the responses because we found that the responses which were recorded were different at times. We found that when we called them for the survey we got the responses as one and while we called them and said that we may be giving them some medical discounts we found the responses as other.
During the year 2021, the covid was at its peak and most people got affected by covid. Post covid, people’s immune systems got broken down and they became prone to several illnesses like diabetes, Asthma. The death rates with covid were very high and this led to the increase in death rates of heart disease because most of the people got infected with key indicators like high BMI and high cholesterol. Not only did the older people got affected by covid it was the younger people who got affected by covid their death rates were also high. From the noted survey, people tend to develop more secondary illnesses which in turn caused them primary disease which is Heart Disease, And got affected with higher death rates.
From the first question one Which gender tends to have more diseases based on BMI, smoking, drinking, and mental health in the year 2020? We concluded that On average, people with heart disease have a slightly higher BMI than those who don’t have heart disease. However, the biggest difference is the number of days that a respondent feels physically unwell. Respondents who have heart disease reported feeling physically unwell about 8 days per month, while healthy people felt unwell for only 3. Poor mental health in 30 days is slightly higher for people with heart disease than for people without. we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease However, we are then forced to ask why do men have more heart disease in our sample? Are they doing more or less of a certain activity than their female counterparts? we compared the different lifestyle variables between men and women and then further compared those same variables between those with and without heart disease allowing us to reject the null hypothesis and conclude that the overall average BMI is different between the male and female populations. From the mean values, we see that men have a mean BMI of 28.50 while women have a BMI of 28.16. A difference of 0.34. From the T-test output, we see that the p-value is 0.604, which is greater than 0.05. This means that we fail to reject the null and conclude that there is no significant difference in the average BMI between men and women who do have heart disease. Since the p-value is less than 5%, there is a difference in the number of days that men and women feel unwell. On average, women reported feeling sick for about 9 days while men reported just over 6 days.
#From the second smart question Are people with heart diseases more prone to developing secondary illnesses’ or is it the other way around? we concluded that secondary illnesses are significantly dependent on heart diseases. We concluded this based on the Chi-squared test we observed that the p-value<0.05. We can reject the null hypothesis. Every secondary illness (Diabetic, Asthma, Kidney disease, and skin cancer) has a relationship with heart disease.
From the third smart question Are smokers or drinkers more prone to developing heart disease? Our analysis was based on the test not to conclude any kind of assumptions or theory we did some chi-squared tests For smoking and drinking the chi^2 tests we performed above, the p-value<0.05. So we can reject the null hypothesis. Both drinking and smoking have a relationship with heart disease.
From all the smart questions we can come to conclude that smokers, drinkers, and people with secondary illnesses are more prone to develop heart diseases and they are high chances of death rates. However, people with high BMI are unhealthy and most often tend to sick.
Next Steps: Next steps are to use what we learned in our dataset and develop a linear model. And try to create a model which is helpful for everyone. As we all know Health is Wealth and without health, we can’t achieve heights so we tend to develop a model which we can use to know which variables are developing heart disease and with that, we can try to decrease the effect on Heart Disease.
To conclude, we found strong evidence from our EDA that people with higher BMI, high cholesterol, and age affect Heart Disease. Moreover, the variables were Sex, BMI, Physical Health, Mental Health, sleep time, Secondary Illnesses, alcohol, and Drinking.
Ara S. A literature review of cardiovascular disease management programs in managed care populations. Journal of Managed Care Pharmacy 2004; 10(4): 326-344.
Centers for Disease Control and Prevention. Underlying Cause of Death, 1999–2018. CDC WONDER Online Database. Atlanta, GA: Centers for Disease Control and Prevention; 2018.
CDC. (2019, December 2). Heart Disease Facts | cdc.gov. Centers for Disease Control and Prevention.
Giampaoli S, Palmieri L, Mattiello A, et al. Definition of high risk individuals to optimize strategies for primary prevention of cardiovascular diseases. Nutr Metab Cardiovasc Dis 2005;15:79–85.
Here’s What an Extra 20 Minutes of Sleep Does to Your Brain. (2020, July 17). Well+Good. https://www.wellandgood.com/benefits-of-getting-more-sleep/
Hong, Y. (2018). Framingham Heart Study (FHS) [Review of Framingham Heart Study (FHS)]. National Heart, Lung and Blood Institute. https://www.nhlbi.nih.gov/science/framingham-heart-study-fhs
Liau, S. Y., Mohamed Izham, M. I., Hassali, M. A., & Shafie, A. A. (2010). A literature review of the cardiovascular risk-assessment tools: applicability among Asian population. Heart Asia, 2(1), 15–18. https://doi.org/10.1136/ha.2009.001115
Robbins, R., Quan, S. F., Weaver, M. D., Bormes, G., Barger, L. K., & Czeisler, C. A. (2021). Examining sleep deficiency and disturbance and their risk for incident dementia and all-cause mortality in older adults across 5 years in the United States. Aging, 13(3), 3254–3268. https://doi.org/10.18632/aging.202591
Sadick, B. (2019, April). Women Die From Heart Attacks More Often Than Men. Here’s Why — and What Doctors Are Doing About It. Time; Time. https://time.com/5499872/women-heart-disease/
Bethesda, MD: National Institutes of Health. https://www.cdc.gov/stroke/conditions.htm#:~:text=Common%20heart%20disorders%20can%20increase,rich%20blood%20to%20the%20brain.
Antoine Soetewey https://statsandr.com/blog/chi-square-test-of-independence-in-r/
https://www.statisticssolutions.com/free-resources/directory-of-statistical-analyses/chi-square/